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Abstract — In this letter, we derive the optimal discriminant 
functions for modulation classification based on the sampled 
distribution distance. The proposed method classifies various 
candidate constellations using a low complexity approach based 
on the distribution distance at specific testpoints along the 
cumulative distribution function. This method, based on the 
Bayesian decision criteria, asymptotically provides the minimum 
classification error possible given a set of testpoints. Testpoint lo- 
cations are also optimized to improve classification performance. 
The method provides significant gains over existing approaches 
that also use the distribution of the signal features. 



I. Introduction 

Modulation classification is the process of choosing the 
most likely scheme from a set of predefined candidate schemes 
that a received signal could belong to. Various approaches 
have been proposed to address this problem. There has re- 
cently been growing interest in modulation classification for 
applications such as software defined radio, cognitive radio 
and interference identification [fl]. 

Existing classification methods can generally be categorized 
into two main groups: feature based classifiers and likelihood 
based (ML) classifiers. The ML classifiers give the minimum 
possible classification error of all possible discriminant func- 
tions given perfect knowledge of the signal's probability dis- 
tribution. However, this approach is very sensitive to modeling 
errors such as imperfect knowledge of the signal to noise ratio 
(SNR) or phase offset. Further, such approaches have very high 
computational complexity and are thus impractical in actual 
hardware implementation. To address these issue, various 
feature based techniques such as cumulant-based classifiers |j2] 
and cylostationary-based classifiers have been proposed [|3). 

Recently, Goodness-of-Fit (GoF) tests such as the 
Kolmogorov-Smirnov (KS) flU distribution distance have been 
proposed to identify the constellation used in QAM modu- 
lation Q. Based on the KS classifier, we proposed a new 
reduced complexity Kuiper (rcK) classifer in [6|. The rcK clas- 
sifier only finds the empirical cumulative distribution function 
(ECDF) in a small set of predetermined testpoints that have 
the highest probability of giving the maximum distribution 
distance, effectively sampling the distribution function. The al- 
gorithm offered reduced computational complexity by remov- 
ing the need to estimate the full ECDF while still providing 
better performance than the KS classifier. It also increased the 
robustness of the classifier to imperfect parameter estimates. 
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The idea of improving the classification accuracy of the rcK 
classifier by using more testpoints was proposed in [7|. The 
method is referred to as Variational Distance (VD) classifier 
where testpoints are selected to be the pdf-crossings of two 
classes being recognized. The sum of the absolute distances 
is then used as the final discriminating statistic. We refer to 
methods such as rcK and VD, that utilize the value of the 
ECDF at a small number of testpoints, as sampled distribu- 
tion distance classifiers. In this work we derive the optimal 
discriminant functions for classification with the sampled 
distribution distance given a set of testpoint locations. We 
also provide a systematic way of finding testpoint locations 
that provide near optimal performance by maximizing the 
Bhattacharyya distance between classes. Finally, we present 
results that compare the performance of this approach with 
existing techniques. 

II. Proposed Classifier 

A. System Model 

Following [5 1, we assume a sequence of M dis- 
crete, complex, i.i.d. and sampled baseband symbols, 
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], drawn from a constellation Aik £ 
{Mi, . . . , Mr], transmitted over AWGN channel. The re- 
ceived signal, under constellation A4k, is given as r = 
[ri---r M ], where r„ = s$ + g n , g„ ~ CJ\f(0,a 2 ). We 
further define the SNR as E^s^) 2 ]/ a 2 . The task of the 
modulation classifier is to find M-y from which r is drawn 
from. Without loss of generality, we consider unit power 
constellations. 



B. Classification Based on Sampled Distribution Distance 

Let z = \z\ ■ ■ ■ zjv] = /(r) where /(•) is the chosen map- 
ping from received symbols r to the extracted feature vector 
z, where N is the length of the feature vector. Possible feature 
maps include |r| (magnitude, N = M), the concatenation of 
5R{r} and 3{r} (quadrature, N = 2M), the phase information 
Zr (angle, N — M), among others. The theoretical CDF of 
Zi given A4k and cr 2 , denoted as Fq (z), is assumed to be 
known a priori (methods of obtaining these distributions, both 
empirically and theoretically, are presented in |]5] Sec. III-A]). 

In this paper we focus on algorithms that use the ECDF 
defined as 
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as the discriminating feature for classification. Here, I(-) is the 
indicator function whose value is 1 if the function argument 
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is true, and otherwise. If the complete ECDF resulting 
from the entire feature vector, z, is used for classification, we 
get the conventional distribution distance measures such as 
Kuiper, Kolmogorov-Smirnov, Anderson-Darling and others. 
Details of these measures are discussed in [4|. Once the 
ECDF is found and the appropriate distribution distance is 
calculated, the candidate constellation with minimum distance 
is chosen. However, prior work in 0, (jT) have shown that 
improved classification accuracy can be achieved at much 
lower computational complexity and with increased model 
robustness by finding the value of the ECDF at a small number 
of specific testpoints. 

We describe these methods formally by defining a set of 
L testpoints: t = \ti---t£\, with tj+i > ti. For notational 
consistency, we also define the following virtual test points, 
to = — oo and i_L+i = +oo in addition to t. Evaluating the 
ECDF from dO at t gives us x = [x\ ■ ■ ■ x£\, Xi = Fn(U). 
We refer to any classifier that utilizes the feature vector x as a 
sampled distribution distance-based classifier. As an example, 
the variational distance (VD) classifier from [7| proposed 
forming t from ECDF points that give either a local maxima 
or minima of the difference between two theoretical cdfs of 
the candidate classes. Instead of using the sampled ECDF 
directly, VD classifier finds the number of samples that fall 
between two consecutive testpoints, which is equivalent to 
taking the difference of the ECDF at consecutive testpoints, 
Fn{U) — Fn(U-x). 

In this paper our goal is to optimize the classification 
accuracy of the sampled distribution distance classification 
approach defined as 
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Intuitively, there are two ways to improve Pc- First, since 
different testpoints have varying distribution distance, it is 
expected that different weights should be assigned to each 
testpoint. Second, the choice of the number and location of 
the points along the ECDF should also be investigated to 
find the proper balance between complexity and classification 
accuracy. Both of these improvements are addressed in the 
following subsection. 

C. Proposed Classifier 

We first assume that t has been selected a priori and our 
goal is to find the optimal classifier for the resulting feature 
vector x. We want to find a discriminant function ,gfc(x) for 
each k G [1, K], for every candidate constellation Mk- Where 
we follow the rule: 



Choose: Mi s.t. $i(x) > <?j(x)Vj ^ i 



(3) 



It is well established in decision theory that if the perfor- 
mance metric used is average classification error, the optimal 
classifier is based on the Bayes decision procedure [8|. This 
procedure can be stated as: 

Choose: Mi s.t. Pr(M l |x) > Pt(Mj |x)Vj ^ i. (4) 



Using the prior probabilities Pr(Mi), the posterior prob- 
abilities Pi(Mi |x) could be found from Pr(x|A / f J ) using 
Bayes formula. Thus, finding the pdf of the feature vector con- 
ditioned on the modulation scheme, Pr (x.\Mi), effectively 
gives us the optimal classifier in the minimum error rate sense. 

The testpoints partition z into L + 1 regions. An individual 
sample, z n , can be in region I, such that < z n < ti, with 
a given probability, completely determined by the cdf, Fq (z). 
The number of samples that fall into each of the regions, n = 
[m • • • nL+i], where rii corresponds to region i, 1 < i < L+l, 
is jointly distributed according to a multinomial probability 
mass function (pmf) given as 
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where p = [pi • • -pi+i], and pi is the probability of an 
individual sample being in region I. Given that z is drawn 
from M k , Pi = F fc (*,) - i^-iX for < I < L + 1. 

Given a particular x, the number of samples in each of 
the L + l regions could be found as rii = N (xi — 
where xq = and x^+i = 1- This gives a mapping from any 
given x to n and therefore to the pmf f(n\N, p) as defined 
in (0. Therefore we have the complete class-conditional pdf, 
Pr(x|A^fe) with p in © determined by Fq(z), the cdf of 
class Mk- Thus we have the optimal classifier. We will refer 
to x and n conditioned on class Mk as x( fe ) and n( fe ). 

Although the multinomial pmf in (© can be used for mini- 
mum error rate classification, its calculation is very computa- 
tionally intensive. To address this issue we note that asymp- 
totically the multinomial pmf, _f(n| N, p) in (0, approaches a 



multivariate Gaussian distribution, ~ jV(a* 
N — > oo. Where, 
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Since x is simply the cumulative sum of n (i.e. Xi 
S}=i which is a linear operation, it follows that x' fc ^ 
J\f{n k , Sfe) where, 



(8) 



(9) 



1=1 m=l 



Having shown that the feature vector x is asymptotically 
Gaussian distributed, we can proceed to apply the Bayes 
decision procedure in (0J. However, the full multivariate pdfs 
are not required to perform classification because the optimal 
discriminant functions for Gaussian feature vectors are known 
to be quadratic with the following form [8 |: 



3 fe (x) = x T W fe x + w^x - 



where 
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and 

wko = -^MfcSfcVfc - \ In |S fc | + InPr (M k ) . (12) 

In the following sections we will simply refer to this classifier 
as the Bayesian approach. 

D. Note on Implementation 

Similar to rcK [6| and VD [7| the Bayesian approach only 
needs to store the testpoint locations for a fixed set of SNRs 
since the theoretical cdf is dependent on SNR. Given a t of size 
L, VD and rcK require both t and /j, k for each class M.k- In 
contrast, the Bayesian approach requires the same vector t, an 
LxL matrix a vector Wk of size L, and a scalar Wko for 
each class M.k- However, there are typically no more than 12 
testpoints (total number of pdf-crossings), so this additional 
storage requirements are negligible. The Bayesian approach 
also requires the calculation of a quadratic form expression 
( [Tol l. Again, due to the fact that only a relatively small number 
of testpoints is used, the additional complexity is minimal. 

E. Testpoint Selection 

In this subsection we present a method for choosing test- 
point locations, t, that provide good classification perfor- 
mance. The method of using the pdf-crossings make intuitive 
sense, since it tries to find the testpoints that provide the 
maximum difference in the theoretical cdf while providing 
some heuristic rule that the testpoints will be spaced apart. 
Tespoints that are too close to each other are not as effective 
because the ECDF tends to be highly correlated and thus 
provide minimal additional information. 

Another issue with using the pdf-crossing is that it does 
not factor in knowledge of the correlation between testpoints. 
As we have shown in Section IH-CI the distribution x follows 
an approximate multivariate Gaussian with statistics given in 
(O and (O. Therefore, the class-conditional means n k and 
covariance matrices Sfe are sufficient to completely describe 
the distribution of the feature vectors conditioned on M.k- 
Thus, these statistics are also sufficient to find the optimal 
testpoint locations, t*. 

However, since are clearly not equal for all M.k, a 
closed form expression for the classification accuracy for this 
problem does not exist. Instead, a if-dimensional integration is 
required and the limits, determined by the decision boundaries 
defined by (fTOb . are non-trivial. As is typically done in this 
scenario, we replace exact Pc with a sub-optimum distance 
metric that is easier to evaluate and does not require a K- 
dimensional integral. In particular we use the Bhattacharyya 
distance first studied for signal selection in [9| shown to be 
a very effective as a "goodness" criterion in the process of 
of selecting effective features to be used in classification. The 
metric is shown here for reference: 

Db = \ (^1 ~ ^2) T S ~ 1 (Ml - M2) 

l ln /|(E 1+ E 2 )/2|\ 
2 ^ J 
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Testpoint Locations 

Fig. 1. Optimized testpoint locations for varying number of testpoints, L. 
The solid line shows the CDF difference between the two classes (4-QAM 
and 16-QAM, under SNR=0 dB, M = 200) 

Note that the Bhattacharyya distance is calculated between 
2 classes. As a result, the search for testpoints can only be 
performed for the K = 2 case. However, this could be done 
sequentially through all the possible pairs of M.k- As Db is 
a function of /j, k and which are functions of our testpoint 
selection, t, then we can express it as Ds{t). We thus find 
the good candidate testpoint by 

t* = argmaxD B (t) , (14) 

under the constraint ti + i > ti. 

As this is an L-dimensional optimization problem, a closed- 
form solution is beyond the scope of this letter paper. Instead, 
we turn to numerical optimization methods (gradient descent 
methods) to find local maxima. The intial point of these 
procedures could be chosen to coincide with the pdf-crossings 
or equally spaced over some interval. 

III. Results and Discussion 
A. Testpoint Selection 

For the results section we focus on the quadrature feature 
which is a concatenation of the I and Q component of each 
symbol. In Fig.Q] we show the results of the testpoint selection 
procedure with = 200, under dB SNR, for varying 
number of testpoints with the two class being 4-QAM and 
16-QAM. The solid line plot corresponds to the difference of 
the two theoretical CDFs. We note that in the VD classifier 
the local maxima and minima of this plot are used as the 
testpoints. However, we find that the numerical optimization 
finds "good" testpoints to be close, but not exactly at the local 
maxima and minima. This is due to the additional information 
provided by the covariance matrices. 

In contrast to VD classifier that has a fixed number of 
testpoints (4 for this particular problem) corresponding to 
the number of local maxima and minima, the optimization 
procedure allows more flexibility in choosing the number of 
testpoints. In Fig. Q] we show the result of the optimization 
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Fig. 2. Effect of increasing number of testpoints on Pq for all possible pairs 
of constellations of interest.The classification accuracy of both ML and VD 
classifiers are also shown for comparison. (SNR=0 dB, Af=200) 
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Fig. 3. Comparison of the proposed Bayesian method with other existing 
approaches under varying SNR with Af=200 symbols used for classification. 
The same number of testpoints are used for both VD and Bayesian. 



procedure for a range of 1 to 8 testpoints. It confirms our 
intuition that "good" testpoints tend to be 1) spaced apart to 
avoid high correlation, 2) concentrated around locations that 
have high CDF difference, and 3) are not necessarily the same 
for different values of L. This result further confirms the need 
to jointly optimize the testpoint locations. 

B. Comparison With Existing Techniques 

As mentioned in the previous section, the proposed ap- 
proach has the flexibility of varying the number of testpoints. 
This effectively gives more flexibility to trade-off classification 
accuracy with computational complexity. This idea is illus- 
trated in Fig. |2] For N = 200 and SNR=0 dB, we show the 
classification accuracy of the proposed method as the number 
of testpoints is increased from 1 to 8, for all possible pairs 
of A4k- The dotted lines correspond to the accuracy of the 
ML classifier which serves as an upperbound to classification 
accuracy, while the dashed lines correspond to that of the VD 
classifier. Note that both are plotted as horizontal lines because 
ML does not utilize testpoints, while VD has a fixed number 
of testpoints corresponding to the pdf-crossings. 

We see that the proposed method is able to exceed the 
accuracy of the VD classifier with as low as 3 testpoints. 
Further, the method's accuracy could be improved by adding 
more testpoints but at the cost of higher complexity. We also 
note that with additional testpoints, the Bayesian classifier 
reaches classification accuracy close to the ML classifier. 

Finally, in Fig. [3] we compare the performance of the 
proposed method with the existing techniques under varying 
SNR with M = 200 symbols used for classification. To 
have a fair comparison, the same number of testpoints are 
used for both VD and Bayesian. For the entire range of 
SNR the proposed Bayesian approach is shown to provide 
substantial gains over the VD classifier. We emphasize again 
that asymptotically, the proposed approach is the optimal 
classifier when using the sampled distribution distance as 
the discriminating feature. Also shown in the plot are the 



classification accuracy of the ML classifier which acts as the 
upperbound, and the conventional Kuiper classifier. 

IV. Conclusion 

In this letter we presented the optimal discriminant functions 
for classifying using the sampled distribution distance. This 
method was shown to provide substantial gains compared to 
other existing approaches. The performance of this method 
is also shown to be close to the ML classifier but at signifi- 
cantly lower computational complexity. Although modulation 
classification is presented in this paper to illustrate the basic 
concept, the approach is not limited to this application. The 
same classifier can be generalized to any classification problem 
where the cdf of each class is available. 
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